Introduction

This is the cleaned version of my regression analysis for my STA302 final project, as my old Markdown file had too much busy work. I will document my data and create graphs here for the sake of organization while writing my final report.

We will work with two datasets: savant_data.csv, which contains the bulk of our working data courtesy of Major League Baseball’s Statcast, and pitchbot.csv, which contains command data courtesy of Cameron Groove’s Pitching Bot on Fangraphs.

Data Processing

Let’s begin by loading in the necessary libraries and cleaning our data.

# Libraries
library(car)
library(tidyverse)

# Loading and cleaning data
fb <- read_csv("savant_data.csv")
fb <- fb %>% select(player_id, total, `downward movement w/ gravity (in)`, 
                    `glove/arm-side movement (in)`, whiff_percent, `pitch (MPH)`,
                    `spin (RPM)`, `vertical release pt (ft)`, `extension (ft)`)

bot <- read_csv("botcmd.csv") %>% select(player_id, `botCmd FA`)

# Merging and cleaning data
fastballs <- merge(fb, bot, by = 'player_id')
renamed_cols <- c('downwards_mvmt' = 'downward movement w/ gravity (in)',
                  'side_mvmt' = 'glove/arm-side movement (in)',
                  'speed' = 'pitch (MPH)',
                  'spin' = 'spin (RPM)',
                  'release_pt' = 'vertical release pt (ft)',
                  'extension' = 'extension (ft)',
                  'command' = 'botCmd FA')

fastballs <- rename(fastballs, renamed_cols)

Now we will preview our data

head(fastballs)
##   player_id total downwards_mvmt side_mvmt whiff_percent speed spin release_pt
## 1    425844  2219           18.2       2.9          15.9  89.5 2248       6.29
## 2    434378  2602           12.2       8.0          18.1  94.3 2428       7.03
## 3    448179  2531           17.6       6.3          21.8  88.4 2203       5.91
## 4    450203  2842           19.5      13.3          22.5  94.9 2307       5.45
## 5    453286  2465           15.5      10.8          23.8  93.7 2360       5.48
## 6    458681  3167           18.4       3.2          31.6  92.4 2424       5.64
##   extension command
## 1       6.0      62
## 2       6.0      49
## 3       6.3      49
## 4       6.2      44
## 5       6.3      59
## 6       6.5      56

Regression

Now we will begin training different models. We will split our model 75-25 into training-validation data.

set.seed(1000)
fastballs_training <- fastballs %>% sample_frac(size = 0.8)
fastballs_validation <- anti_join(fastballs, fastballs_training)

EDA

(eda)

summarize(fastballs_training, pitchers = n(), 
          `min speed` = min(speed), `max speed` = max(speed),
          `average speed` = mean(speed), `median speed` = median(speed), 
          `sd speed` = sd(speed))
##   pitchers min speed max speed average speed median speed sd speed
## 1      120      88.4      99.1      94.33083         94.5 2.140753

Model Selection

We will proceed by backwards elimination, so we will create our first model using every possible parameter.

Model 1

reg_all <- lm(whiff_percent ~ speed + spin + downwards_mvmt + 
              side_mvmt + release_pt + extension + command, 
              data = fastballs_training)

summary(reg_all)
## 
## Call:
## lm(formula = whiff_percent ~ speed + spin + downwards_mvmt + 
##     side_mvmt + release_pt + extension + command, data = fastballs_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2833  -2.3264   0.0322   2.8411  10.6880 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -8.141909  28.986523  -0.281 0.779316    
## speed           0.401817   0.238051   1.688 0.094205 .  
## spin            0.007566   0.002817   2.686 0.008339 ** 
## downwards_mvmt -0.544637   0.216371  -2.517 0.013246 *  
## side_mvmt       0.263241   0.131540   2.001 0.047787 *  
## release_pt     -4.211420   1.094624  -3.847 0.000199 ***
## extension       0.786705   0.941384   0.836 0.405109    
## command         0.008784   0.049753   0.177 0.860179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.069 on 112 degrees of freedom
## Multiple R-squared:  0.3804, Adjusted R-squared:  0.3416 
## F-statistic: 9.821 on 7 and 112 DF,  p-value: 1.675e-09

Let’s check the assumptions of normality, linearity, and constant variance: Now let’s also check multicollinearity by looking at the Variance Inflation Factor:

vif(reg_all)
##          speed           spin downwards_mvmt      side_mvmt     release_pt 
##       1.866332       1.185329       2.117692       1.380988       1.595473 
##      extension        command 
##       1.114783       1.046809

Based on the above, it seems like there may be some outliers. It also appears like every assumption is satisfied, except potentially linearity and multicollinearity. The former we might address through a transformation. The latter, while indeed indicating some degree of multicollinearity (the first five predictors in particular, since they seem to be a notable degree larger than 1), is not necessary to address since the VIF \(< 5\).

There are two possibilities for transformations: 1. We use Box-Cox to decide on a power transformation. 2. We use the arcsine variance stabilizing transformation, since our predicted value is a percentage.

Let’s see how both perform. First, to decide on a power transformation:

MASS::boxcox(reg_all)

Based on the Box-Cox plot, because \(1\) is closeset to the Maximum Likelihood Estimator, no transformation seems necessary. Let’s proceed with the arcsine transformation and see if it’s any help.

# Arcsine transformed response
asine_whiff_percent <- asin(sqrt(fastballs_training$whiff_percent / 100))

# Transformed regression
asine_reg_all <- lm(asine_whiff_percent ~ speed + spin + downwards_mvmt + 
                    side_mvmt + release_pt + extension + command, 
                    data = fastballs_training)
summary(asine_reg_all)
## 
## Call:
## lm(formula = asine_whiff_percent ~ speed + spin + downwards_mvmt + 
##     side_mvmt + release_pt + extension + command, data = fastballs_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.169895 -0.027220 -0.000181  0.035785  0.128632 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.509e-02  3.608e-01   0.264 0.792590    
## speed           5.152e-03  2.963e-03   1.739 0.084795 .  
## spin            8.907e-05  3.506e-05   2.540 0.012451 *  
## downwards_mvmt -6.634e-03  2.693e-03  -2.463 0.015282 *  
## side_mvmt       3.403e-03  1.637e-03   2.078 0.039969 *  
## release_pt     -5.011e-02  1.362e-02  -3.678 0.000363 ***
## extension       9.376e-03  1.172e-02   0.800 0.425302    
## command         8.799e-05  6.193e-04   0.142 0.887265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05065 on 112 degrees of freedom
## Multiple R-squared:  0.3715, Adjusted R-squared:  0.3322 
## F-statistic: 9.457 on 7 and 112 DF,  p-value: 3.511e-09
par(mfrow = c(1,2))
plot(asine_reg_all)

plot(reg_all)

It actually looks a little worse. Either way, it looks like point 27, associated with Shane Bieber, is the worst performer. Let’s look at his standardized residual:

rstandard(reg_all)[27]
##        27 
## -3.105265
rstandard(asine_reg_all)[27]
##        27 
## -3.450755

At this range, whether or not we remove him seems to be a bit of a judgement call. Let’s try removing him and see what happens.

updated_fastballs_training <- fastballs_training %>%  filter(player_id != 669456) 
reg_all_removed <- lm(whiff_percent ~ speed + spin + downwards_mvmt + 
                      side_mvmt + release_pt + extension + command, 
                      data = updated_fastballs_training)

summary(reg_all_removed)
## 
## Call:
## lm(formula = whiff_percent ~ speed + spin + downwards_mvmt + 
##     side_mvmt + release_pt + extension + command, data = updated_fastballs_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1095  -2.3807  -0.0609   2.6417  10.7738 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.480854  28.066282   0.124  0.90152    
## speed           0.268805   0.232267   1.157  0.24963    
## spin            0.007614   0.002705   2.814  0.00578 ** 
## downwards_mvmt -0.605074   0.208616  -2.900  0.00449 ** 
## side_mvmt       0.314302   0.127298   2.469  0.01507 *  
## release_pt     -4.273711   1.051322  -4.065    9e-05 ***
## extension       0.958623   0.905555   1.059  0.29208    
## command         0.021236   0.047932   0.443  0.65860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.908 on 111 degrees of freedom
## Multiple R-squared:  0.4014, Adjusted R-squared:  0.3636 
## F-statistic: 10.63 on 7 and 111 DF,  p-value: 3.482e-10
par(mfrow = c(1,2))
plot(reg_all_removed)

# Arcsine version
arcsine_reg_all_removed <- lm(asin(sqrt(updated_fastballs_training$whiff_percent / 100)) ~
                              speed + spin + downwards_mvmt + side_mvmt + release_pt + 
                              extension + command, data = updated_fastballs_training)
summary(arcsine_reg_all_removed)
## 
## Call:
## lm(formula = asin(sqrt(updated_fastballs_training$whiff_percent/100)) ~ 
##     speed + spin + downwards_mvmt + side_mvmt + release_pt + 
##         extension + command, data = updated_fastballs_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.135826 -0.028609 -0.000492  0.032223  0.129819 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.559e-01  3.454e-01   0.741 0.460468    
## speed           3.313e-03  2.859e-03   1.159 0.249048    
## spin            8.973e-05  3.330e-05   2.695 0.008141 ** 
## downwards_mvmt -7.470e-03  2.568e-03  -2.909 0.004376 ** 
## side_mvmt       4.109e-03  1.567e-03   2.622 0.009956 ** 
## release_pt     -5.097e-02  1.294e-02  -3.939 0.000143 ***
## extension       1.175e-02  1.115e-02   1.055 0.293925    
## command         2.602e-04  5.900e-04   0.441 0.660018    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0481 on 111 degrees of freedom
## Multiple R-squared:  0.3975, Adjusted R-squared:  0.3595 
## F-statistic: 10.46 on 7 and 111 DF,  p-value: 4.861e-10
par(mfrow = c(2,2))
plot(arcsine_reg_all_removed)

Normality seems to look better in both cases, although it still seems like the non-transformed data looks best. Speed is no longer significant, and downwards movement and side movement have become more significant. Let’s proceed with with the non-transformed data set with Bieber removed.

Model 2.1, 2.2

The least significant of the previous ones was side movement. Let’s try removing that.

reg_sdr <- lm(whiff_percent ~ spin + downwards_mvmt + release_pt, 
              data = updated_fastballs_training)

summary(reg_sdr)
## 
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + release_pt, 
##     data = updated_fastballs_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6167  -2.4733   0.1696   2.3800  10.6984 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    52.451986  10.969020   4.782 5.19e-06 ***
## spin            0.006236   0.002726   2.287    0.024 *  
## downwards_mvmt -0.793945   0.170495  -4.657 8.69e-06 ***
## release_pt     -5.609603   0.985721  -5.691 9.80e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.049 on 115 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3167 
## F-statistic: 19.23 on 3 and 115 DF,  p-value: 3.571e-10

Was this okay? Let’s validate with the partial \(F\) test.

anova(reg_all_removed, reg_sdr)
## Analysis of Variance Table
## 
## Model 1: whiff_percent ~ speed + spin + downwards_mvmt + side_mvmt + release_pt + 
##     extension + command
## Model 2: whiff_percent ~ spin + downwards_mvmt + release_pt
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    111 1694.9                              
## 2    115 1885.7 -4   -190.74 3.1229 0.01778 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It looks like even this was significant, so no, we cannot proceed.

Let’s try removing everything then.

reg_sdsr <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt + 
                 release_pt, data = updated_fastballs_training)

summary(reg_sdsr)
## 
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + side_mvmt + 
##     release_pt, data = updated_fastballs_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4660  -2.2224  -0.0577   2.7538  11.0474 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    42.483699  11.034128   3.850 0.000195 ***
## spin            0.007335   0.002650   2.768 0.006586 ** 
## downwards_mvmt -0.786147   0.164290  -4.785 5.16e-06 ***
## side_mvmt       0.353517   0.112469   3.143 0.002130 ** 
## release_pt     -4.811014   0.983130  -4.894 3.28e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.901 on 114 degrees of freedom
## Multiple R-squared:  0.3871, Adjusted R-squared:  0.3656 
## F-statistic:    18 on 4 and 114 DF,  p-value: 1.747e-11

Let’s confirm if we can remove speed, extension, and command using a partial \(F\) test.

anova(reg_all_removed, reg_sdsr)
## Analysis of Variance Table
## 
## Model 1: whiff_percent ~ speed + spin + downwards_mvmt + side_mvmt + release_pt + 
##     extension + command
## Model 2: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    111 1694.9                           
## 2    114 1735.3 -3   -40.348 0.8808 0.4534

Since \(p = 0.4534\), it looks like we are good to go. Let’s recheck our assumptions:

Normality still looks good, although there seems to be some fanning behavior in the residuals. Let’s re-introduce the arcsine transformation and see if it’s any help this time around.

# Arcsine version
arcsine_reg_sdsr_removed <- lm(asin(sqrt(updated_fastballs_training$whiff_percent / 100)) ~
                              spin + downwards_mvmt + side_mvmt + release_pt, 
                              data = updated_fastballs_training)

summary(arcsine_reg_sdsr_removed)
## 
## Call:
## lm(formula = asin(sqrt(updated_fastballs_training$whiff_percent/100)) ~ 
##     spin + downwards_mvmt + side_mvmt + release_pt, data = updated_fastballs_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.140190 -0.025379  0.001613  0.032034  0.133183 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.358e-01  1.358e-01   5.418 3.40e-07 ***
## spin            8.633e-05  3.262e-05   2.647  0.00928 ** 
## downwards_mvmt -9.699e-03  2.022e-03  -4.797 4.92e-06 ***
## side_mvmt       4.594e-03  1.384e-03   3.319  0.00122 ** 
## release_pt     -5.757e-02  1.210e-02  -4.758 5.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04802 on 114 degrees of freedom
## Multiple R-squared:  0.3832, Adjusted R-squared:  0.3616 
## F-statistic: 17.71 on 4 and 114 DF,  p-value: 2.488e-11
par(mfrow = c(1,2))
plot(arcsine_reg_sdsr_removed)

plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$spin,
     main = 'Spin',
     xlab = 'Spin (RPM)', ylab = 'Residuals',
     col = 'orange')
plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$downwards_mvmt,
     main = 'Downwards Movement',
     xlab = 'Downwards Movement (in)', ylab = 'Residuals',
     col = 'goldenrod')

plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$side_mvmt,
     main = 'Side Movement',
     xlab = 'Side Movement (in)', ylab = 'Residuals',
     col = 'darkseagreen')
plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$release_pt,
     main = 'Release Point',
     xlab = 'Release Point (ft)', ylab = 'Residuals',
     col = 'cadetblue')

If anything, it looks worse. How about WLS?

weights <- 1 / lm(abs(reg_sdsr$residuals) ~ reg_sdsr$fitted.values)$fitted.values^2

wls_reg_sdsr <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt,
                data = updated_fastballs_training, weights = weights)

summary(wls_reg_sdsr)
## 
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + side_mvmt + 
##     release_pt, data = updated_fastballs_training, weights = weights)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4502 -0.7408  0.0046  0.9206  3.5077 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    44.332631  11.089197   3.998 0.000114 ***
## spin            0.007119   0.002647   2.689 0.008241 ** 
## downwards_mvmt -0.805156   0.166424  -4.838 4.14e-06 ***
## side_mvmt       0.345446   0.112552   3.069 0.002683 ** 
## release_pt     -4.984196   0.976764  -5.103 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.292 on 114 degrees of freedom
## Multiple R-squared:  0.3927, Adjusted R-squared:  0.3714 
## F-statistic: 18.43 on 4 and 114 DF,  p-value: 1.049e-11

It looks like this model does the best at not validating assumptions. Let’s check the last one: Multicollinearity

vif(wls_reg_sdsr)
##           spin downwards_mvmt      side_mvmt     release_pt 
##       1.153141       1.353217       1.096323       1.425529

Which appears to still be fine.

Models 3 & 4 (?)

Can we go simpler? Let’s pick the two most significant parameters.

reg_dr <- lm(whiff_percent ~ downwards_mvmt + release_pt,
                data = updated_fastballs_training)
summary(reg_dr)
## 
## Call:
## lm(formula = whiff_percent ~ downwards_mvmt + release_pt, data = updated_fastballs_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7028  -2.5705   0.0924   2.5525  10.2306 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     71.8809     7.0658  10.173  < 2e-16 ***
## downwards_mvmt  -0.9084     0.1659  -5.475 2.57e-07 ***
## release_pt      -6.2082     0.9675  -6.417 3.15e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.123 on 116 degrees of freedom
## Multiple R-squared:  0.3037, Adjusted R-squared:  0.2917 
## F-statistic:  25.3 on 2 and 116 DF,  p-value: 7.603e-10

Now let’s conduct a partial F test to see if this was okay.

anova(wls_reg_sdsr, reg_dr)
## Analysis of Variance Table
## 
## Model 1: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
## Model 2: whiff_percent ~ downwards_mvmt + release_pt
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    114  190.35                                  
## 2    116 1971.45 -2   -1781.1 533.34 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which is very significant, so it appears like at least one of spin and side movement was significant. Let’s take them up individually.

reg_sdr <- lm(whiff_percent ~ spin + downwards_mvmt + release_pt,
                data = updated_fastballs_training)
summary(reg_sdr)
## 
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + release_pt, 
##     data = updated_fastballs_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6167  -2.4733   0.1696   2.3800  10.6984 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    52.451986  10.969020   4.782 5.19e-06 ***
## spin            0.006236   0.002726   2.287    0.024 *  
## downwards_mvmt -0.793945   0.170495  -4.657 8.69e-06 ***
## release_pt     -5.609603   0.985721  -5.691 9.80e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.049 on 115 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3167 
## F-statistic: 19.23 on 3 and 115 DF,  p-value: 3.571e-10
anova(wls_reg_sdsr, reg_sdr)
## Analysis of Variance Table
## 
## Model 1: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
## Model 2: whiff_percent ~ spin + downwards_mvmt + release_pt
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    114  190.35                                  
## 2    115 1885.66 -1   -1695.3 1015.3 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So side movement was important. What about if we remove spin?

reg_dsr <- lm(whiff_percent ~ downwards_mvmt + side_mvmt + release_pt,
                data = updated_fastballs_training)
summary(reg_dsr)
## 
## Call:
## lm(formula = whiff_percent ~ downwards_mvmt + side_mvmt + release_pt, 
##     data = updated_fastballs_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6596 -2.3802 -0.1801  2.7471 10.4662 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     66.0962     7.1981   9.182 2.08e-15 ***
## downwards_mvmt  -0.9194     0.1616  -5.690 9.83e-08 ***
## side_mvmt        0.3125     0.1147   2.725  0.00744 ** 
## release_pt      -5.5956     0.9683  -5.779 6.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.013 on 115 degrees of freedom
## Multiple R-squared:  0.346,  Adjusted R-squared:  0.3289 
## F-statistic: 20.28 on 3 and 115 DF,  p-value: 1.285e-10
anova(wls_reg_sdsr, reg_sdr)
## Analysis of Variance Table
## 
## Model 1: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
## Model 2: whiff_percent ~ spin + downwards_mvmt + release_pt
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    114  190.35                                  
## 2    115 1885.66 -1   -1695.3 1015.3 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which is also significant. So it seems like we cannot go simpler.

Selecting a model

Let’s now compare our original model to our simplified model. We will use corrected AIC (since \(n = 119\), \(p \geq 4\), so \(\frac{n}{K} = \frac{119}{4 + 2} \leq 40\)), BIC, and adjusted \(R^2\) to decide.

models = c('All', 'Speed + Downwards Movement + Sideways Movement + Release Point')

n_predictors = c(7, 4)

adj_r_squared = c(summary(reg_all_removed)$adj.r.squared,
              summary(wls_reg_sdsr)$adj.r.squared)

bic = c(BIC(reg_all_removed), 
        BIC(wls_reg_sdsr))

aic_corrected = c(AIC(reg_all_removed) + (2*9*10)/(119-9+1), 
                AIC(wls_reg_sdsr) + (2*6*7)/(103-6+1))

selection <- tibble(models, n_predictors, adj_r_squared, aic_corrected, bic)
selection
## # A tibble: 2 × 5
##   models                                           n_pre…¹ adj_r…² aic_c…³   bic
##   <chr>                                              <dbl>   <dbl>   <dbl> <dbl>
## 1 All                                                    7   0.364    673.  697.
## 2 Speed + Downwards Movement + Sideways Movement …       4   0.371    669.  685.
## # … with abbreviated variable names ¹​n_predictors, ²​adj_r_squared,
## #   ³​aic_corrected

So by all counts, our second model is better. We’ll pick that one to move onto our final step: Validation.

Model Validation

reg_sdsr_validation <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt,
                          data = fastballs_validation)

# summary(reg_sdsr_validation)

weights_val <- 1 / lm(abs(reg_sdsr_validation$residuals) ~ reg_sdsr_validation$fitted.values)$fitted.values^2

wls_reg_sdsr_val <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt,
                data = fastballs_validation, weights = weights_val)

summary(wls_reg_sdsr_val)
## 
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + side_mvmt + 
##     release_pt, data = fastballs_validation, weights = weights_val)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12604 -0.94320  0.05419  0.89935  3.01236 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    69.387066  30.152078   2.301  0.02999 * 
## spin            0.005538   0.008182   0.677  0.50473   
## downwards_mvmt -1.119305   0.463606  -2.414  0.02341 * 
## side_mvmt      -0.134745   0.329510  -0.409  0.68608   
## release_pt     -7.124670   2.508304  -2.840  0.00883 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.325 on 25 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.2649 
## F-statistic: 3.612 on 4 and 25 DF,  p-value: 0.01863
plot(wls_reg_sdsr_val)

par(mfrow = c(1,2))

plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$spin,
     main = 'Spin',
     xlab = 'Spin (RPM)', ylab = 'Residuals',
     col = 'orange')
plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$downwards_mvmt,
     main = 'Downwards Movement',
     xlab = 'Downwards Movement (in)', ylab = 'Residuals',
     col = 'goldenrod')

plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$side_mvmt,
     main = 'Side Movement',
     xlab = 'Side Movement (in)', ylab = 'Residuals',
     col = 'darkseagreen')
plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$release_pt,
     main = 'Release Point',
     xlab = 'Release Point (ft)', ylab = 'Residuals',
     col = 'cadetblue')

# plot(reg_sdsr_validation)

It looks like point 15 has high leverage in this dataset (Felix Bautista), and point 12 and 2 are very close as well (. There also appear to be quite a few meaningful deviations away from the normal distribution. In particular, point 17 (Alex Faedo) and 15 (Felix Bautista) seem to be pretty far.

Let’s compare with our test regression:

models = c('Training', 'Validation')

n = c(119, 30)

adj_r_squared = c(summary(wls_reg_sdsr)$adj.r.squared,
                  summary(wls_reg_sdsr_val)$adj.r.squared)

intercept_estimate <- c(wls_reg_sdsr$coefficients[1],
                        wls_reg_sdsr_val$coefficients[1])

spin_estimate <- c(wls_reg_sdsr$coefficients[2],
                        wls_reg_sdsr_val$coefficients[2])

downwards_mvmt_estimate <- c(wls_reg_sdsr$coefficients[3],
                        wls_reg_sdsr_val$coefficients[3])

side_mvmt_estimate <- c(wls_reg_sdsr$coefficients[4],
                        wls_reg_sdsr_val$coefficients[4])

release_pt_estimate <- c(wls_reg_sdsr$coefficients[5],
                        wls_reg_sdsr_val$coefficients[5])

validation <- tibble(models, n, adj_r_squared, spin_estimate, downwards_mvmt_estimate,
                    side_mvmt_estimate, release_pt_estimate)
validation
## # A tibble: 2 × 7
##   models         n adj_r_squared spin_estimate downwards_mvmt_…¹ side_…² relea…³
##   <chr>      <dbl>         <dbl>         <dbl>             <dbl>   <dbl>   <dbl>
## 1 Training     119         0.371       0.00712            -0.805   0.345   -4.98
## 2 Validation    30         0.265       0.00554            -1.12   -0.135   -7.12
## # … with abbreviated variable names ¹​downwards_mvmt_estimate,
## #   ²​side_mvmt_estimate, ³​release_pt_estimate

So besides the spin estimate, they really aren’t too similar. Let’s also check that there’s no multicollinearity, since that’s never been an issue.

vif(wls_reg_sdsr_val)
##           spin downwards_mvmt      side_mvmt     release_pt 
##       1.112659       1.394939       1.649884       1.980719

Let’s take a look at the training dataset:

summarize(fastballs_validation, pitchers = n(), 
          `min whiff` = min(whiff_percent), `max whiff` = max(whiff_percent),
          `average whiff` = mean(whiff_percent), `median whiff` = median(whiff_percent), 
          `sd whiff` = sd(whiff_percent))
##   pitchers min whiff max whiff average whiff median whiff sd whiff
## 1       30      12.4      37.8      22.86667        21.15 6.428725
summarize(updated_fastballs_training, pitchers = n(), 
          `min whiff` = min(whiff_percent), `max whiff` = max(whiff_percent),
          `average whiff` = mean(whiff_percent), `median whiff` = median(whiff_percent), 
          `sd whiff` = sd(whiff_percent))
##   pitchers min whiff max whiff average whiff median whiff sd whiff
## 1      119       9.7      31.5      22.06303         21.7  4.89851

Let’s also look at the number of total fastballs thrown by each group.

mean(updated_fastballs_training$total)
## [1] 1981.277
mean(fastballs_validation$total)
## [1] 1884.667

Looks like the second group, on average, threw 100 less fastballs than the first. This might explain some of the deviations.